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We consider the problem of computing reachability probabilities: given a Markov chain, an initial 
state of the Markov chain, and a set of goal states of the Markov chain, what is the probability of 
reaching any of the goal states from the initial state? This problem can be reduced to solving a linear 
equation A • x = b for x, where A is a matrix and b is a vector. We consider two iterative methods 
to solve the linear equation: the Jacobi method and the biconjugate gradient stabilized (BiCGStab) 
method. For both methods, a sequential and a parallel version have been implemented. The parallel 
versions have been implemented on the compute unified device architecture (CUDA) so that they 
can be run on a NVIDIA graphics processing unit (GPU). From our experiments we conclude that as 
the size of the matrix increases, the CUDA implementations outperform the sequential implementa- 
tions. Furthermore, the BiCGStab method performs better than the Jacobi method for dense matrices, 
whereas the Jacobi method does better for sparse ones. Since the reachability probabilities problem 
plays a key role in probabilistic model checking, we also compared the implementations for matrices 
obtained from a probabilistic model checker Our experiments support the conjecture by Bosnacki 
et al. that the Jacobi method is superior to Krylov subspace methods, a class to which the BiCGStab 
method belongs, for probabilistic model checking. 

1 Introduction 

Given a Markov chain, an initial state of the Markov chain, and a set of goal states of the Markov chain, 
we are interested in the probability of reaching any of the goal states from the initial state. This proba- 
bility is known as the reachability probability. These reachability probabilities play a key role in several 
fields, including probabilistic model checking (see, for example, 1 1 , Chapter 10]) and performance eval- 
uation (see, for example, Il9ll). 

As we will sketch in Section [2] computing reachability probabilities can be reduced to solving a 
linear equation of the form A • x = b for x, where A is an « x «-matrix and b is an w-vector. Although the 
equation can be solved by inverting the matrix A, such an approach becomes infeasible for large matrices 
due to the computational complexity of matrix inversion. For instance, Gauss-Jordan elimination has 
time complexity 0{n^) (see, for example, lfT2l Chapter 2]). For large matrices, iterative methods are used 
instead. 

The iterative methods compute successive approximations to obtain a more accurate solution to the 
linear system at each iteration. For an overview of linear methods, we refer the reader to, for example, E 
Chapter 2]. In this paper, we consider two linear methods, namely the Jacobi method and the biconjugate 
gradient stabilized (BiCGStab) method. 

We have implemented a sequential version and a parallel version of the Jacobi and BiCGStab method. 
The sequential versions have been implemented in C. The parallel versions have been implemented 
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using NVIDIA's compute unified device arcliitecture (CUDA). It allows us to run C code on a graphics 
processing unit (GPU) with hundreds of cores. Currently, CUDA is only supported by NVIDIA CPUs. 

Our CUDA implementation of the lacobi method is based on the one described by Bosnacki et al. 
in imm. When we started this research, we were aware of the paper by Gaikwad and Toke [7| that 
mentions a CUDA implementation of the BiCGStab method. Since we did not have access to their code, 
we implemented the BiCGStab method in CUDA ourselves. 

To compare the performance of our four implementations, we constructed three sets of tests. First 
of all, we randomly generated matrices with varying sizes and densities. Our experiments show that 
the BiCGStab method is superior to the Jacobi method for denser matrices. They also demonstrate a 
consistent performance benefit from using CUDA to implement the Jacobi method. However, we observe 
that the CUDA version of the BiCGStab method is only beneficial for larger, denser matrices. For the 
smaller and sparser matrices, the sequential version of the BiCGStab method outperforms the CUDA 
version. 

Secondly, we used an extension of the model checker Java PathFinder (JPF) to generate matrices. 
JPF can check properties of Java code, such as the absence of uncaught exceptions. Zhang fT4ll created 
a probabilistic extension of JPF that can check properties of randomized sequential algorithms. We 
used this extension to generate transition probability matrices corresponding to the Java code of two 
randomized sequential algorithms. The Jacobi method performed better than the BiCGStab method for 
these matrices. This supports the conjecture by Bosnacki et al. H HI that the Jacobi method is superior 
to Krylov subspace methods, a class to which the BiCGStab method belongs, for probabilistic model 
checking. 

Finally, we randomly generated matrices with the same sizes and densities as the matrices produced 
by JPF. We obtained very similar results. This suggests that size and density are the main determinants 
of which implementation performs best on probabilistic model checking data, and whether CUDA will 
be beneficial, rather than other properties unique to matrices found in probabilistic model checking. 



2 Reachability Probabilities of a Markov Chain 

In this paper, the term "Markov chain" refers to a discrete-time Markov chain. Below, we review the 
well-known problem of computing the reachability probabilities of a Markov chain. Our presentation is 
based on |[T] Section 10.1]. A Markov chain is a triple ^ = {S,'P,sq) consisting of 

• a nonempty and finite set S of states, and 

• a transition probability function P : S x 5 — )• [0, 1] satisfying for all s £ S, 

s'es 

• and an initial state sq £ S. 

The transition probability function can be represented as a matrix. This matrix has rows and columns 
corresponding to the states of the Markov chain, and entries representing the probabilities of the transi- 
tions between these states. For example, the probability transition function of the Markov chain depicted 
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A path of a Markov chain ^ is an infinite sequence 51^2 ... of states such that ¥{si,Si+i) > for all 
/ > 1. We denote the set of paths of ^ that start in state s by Paths (s). For the above Markov chain, the 
set of paths starting in state is { (02)" 1® | « G N} U { (02)« 3™ | « G N} U {(02)®}. 

To define the probability of events such as eventually reaching a set of states from the initial state, 
we associate a probability space with a Markov chain ^ and a state s. Recall that a probability space 
consists of a set, a a-algebra, and a probability measure. In this case, the set is Paths{s). The a-algebra 
is generated by the so-called cylinder sets. Given a finite sequence of states s\ . . .Sn, its cylinder set 
Cyl{s\ . . . 5„) is defined by 

Cyl{s\ . . .Sn) = { 71 G Paths(s) \ si .. .s„ is a. prefix of tt}. 

For the above Markov chain, we have that Cyl{03) = {0 3®}. We define 

Pr{Cyl{si...Sn))= n P(^n^/+i)- 

l<i<n 

Recall that Pr can be uniquely extended to a probability measure on the a-algebra generated by the 
cylinder sets. For the above Markov chain, we have that Pr{Cyl{03)) = 0.5. 



2.1 Property of Interest 

In the following, we are interested in two particular events. First of all, given a Markov chain a 
state s of the Markov chain and a set of states GS, known as the goal states, of the Markov chain, we 
are interested in the probability of reaching a state in GS in one transition when starting in s. We denote 
the set of paths starting in s that reach a state in GS in one transition by { tt G Paths{s) \ n \= Q)GS}. 
This set can be shown to be measurable, that is, it belongs to the a-algebra. Its probability we denote by 
Pr{s h OGS). 

Secondly, given a Markov chain a state s of the Markov chain and a set of goal states GS, 
we are interested in the probability of reaching a state in GS in zero or more transitions when starting 
in s. We denote the set of paths starting in s that reach a state in GS in zero or more transitions by 
{ TT G Paths{s) I K 1= ()GS}. Also this set can be shown to be measurable. Its probabiUty we denote by 
Pr{s ^ OGS). 

The problem we are examining in this paper is the following: Given a Markov chain and set of 
goal states GS, what is the probability to reach a state in GS in zero or more transitions from the initial 
state? In other words, what is the probability of states in GS eventually being reached? That is, we want 
to compute Pr{sQ |= ()GS), where is the initial state of This is what is referred to as computing 
reachability probabilities in fTJ Chapter 10.1.1]. 
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2.2 Partitioning the Set of States 

To compute the reachability probabiUties, one usually first partitions the set of states of the Markov chain 
into three parts, based on their probability to reach the set GS of goal states: 

• S=i = {seS\ Pr{s ^ 0G5) = 1 } 

• 5=0 = e 5 I Pr{s \= OGS) = 0} 

• 5? = 5\(5=iU5=o) 

The partitioning of the set of states can be done easily by considering the underlying digraph of the 
Markov chain. The vertices of this digraph are the states of the Markov chain. There is an edge from 
state s to state s' if and only if P(5, s') > 0. Using graph algorithms, such as depth-first-search or breadth- 
first-search, the set of states can be partitioned as follows: 

• To find S=o, determine the set of all states that can reach GS (including the states in GS). The 
complement of this set is 5=o. 

• To find S=\, determine the set of all states that can reach 5=o- The complement of this set is S^i. 

• To find 5?, simply take the complement of 5=i U 5=o- 

Consider the above Markov chain and let 3 be the only goal state. Then S=o = {1}, 5=1 = {3} and 
5? = {0,2}. 



2.3 Computing Reachability Probabilities by State 

To compute the reachability probabiUties, one must determine the probability of the initial state leading 
to a state in GS, that is, one has to compute Pr{so |= (}GS). To do so, one must also determine the 
probabilities of reaching a state in GS from other states. 

For each state s we compute Xg, which is the probability of reaching GS from s, that is, 
Xs = Pr{s \= OGS). The values of Xg, for s e S, can be expressed as a vector, x. For any state s G 5=i, by 
definition = 1. Similarly, for each s G 5=o> = 0. So once the states have been partitioned into the 
three sets described above, the only values of x that need to be calculated are {x^ | j G 5? }. These values 
satisfy the following equation: 

x,= £ Pis,s')-Xg>+ P{s,s'). 

s'eS\GS .s'gGS 

So, we will create a matrix A, which includes only the transition probabilities between states in 5?. For 
each s, s' G 5?, A^ ^/ = P{s,s'). To aid in calculations, we will also create a vector b. For each s G 5?, 
bg = Pr{s \= QGS), that is, the probabihty of a state in GS being reached from s in one transition. 
Consider the above Markov chain and let 3 be the only goal state. Then states 1 and 3 are excluded 
because they do not belong to 5?, and 



A = 
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andb = 
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The equation for x above can be written as x = A • x + b. Rearranged, this becomes (I — A) • x = b, 
where I is an identity matrix. A and b are already known from the Markov chain. So, x can be found by 
solving the Unear equation (I — A) • x = b for x. 

There are several ways to solve the equation (I — A) ■ x = b for x. Most obviously, one can find 
(I — A)~^ However, methods to find matrix inverses tend to have high computational complexity and. 
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hence, for large matrices this becomes infeasible. For instance, Gauss-Jordan elimination has time com- 
plexity 0{n^) (see, for example, [ 12, Chapter 2]). 

Iterative approximation methods find solutions that are within a specified margin of error of the exact 
solutions, and can work much more quickly. The methods used in this paper are in this category, and will 
be discussed in more detail next. 

3 Iterative Methods 

Solving the linear equation (I — A) • x = b for x can be very time-consuming, especially when A is 
large. To solve this equation, there are numerous iterative methods. These methods compute successive 
approximations to obtain a more accurate solution to the linear system at each iteration. For an overview 
of linear methods, we refer the reader to, for example, [3 , Chapter 2]. 

These iterative methods can be classified into two groups: the stationary methods and the nonstation- 
ary ones. In stationary methods, the same information is used in each iteration. As a consequence, these 
methods are usually easier to understand and implement. However, convergence of these methods may 
be slow. In this paper, we consider one stationary linear method, namely the Jacobi method. 

In nonstationary methods, the information used may change per iteration. These methods are usually 
more intricate and harder to implement, but often give rise to faster convergence. In this paper we focus 
on one particular nonstationary linear method, namely the biconjugate gradient stabilized (BiCGStab) 
method. 

The BiCGStab method was developed by Van der Vorst |[T3l . It is a type of Krylov subspace method. 
Unlike most of these methods, it can solve non-symmetric linear systems, and was designed to minimize 
the effect of rounding errors. If exact arithmetic is used, it will terminate in at most n iterations for an 
nxn matrix [13, page 636]. In practice, it often requires much fewer iterations to find an approximate 
solution. 

The un-preconditioned version of this method was used, as the sequential and GPU performance of 
the preconditioning method would be a separate issue. For some matrices a preconditioner is necessary 
to use the BiCGStab method, but that is not the case for the data encountered here. Buchholz [6| did a 
limited comparison between a preconditioned and un-preconditioned version of the BiCGStab method 
specifically on matrices that represent stochastic processes, and it does not show a large performance 
difference. 

3.1 The Jacobi Method 

Given the matrix A and the vector b, the Jacobi method returns an approximate solution for x in the 
linear equation A • x = b. Below, we only present its pseudocode. For a detailed discussion of the Jacobi 
method, we refer the reader to, for example, [Sj Section 2.2.1]. 

Jacobi (A,b) : 

X := random vector 



repeat 



X 



:= X 



for all 



i = \ ...n do 



until X is 
return x 



accurate enough 
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3.2 The BiCGStab Method 

Like the Jacobi method, the BiCGStab method returns an approximate solution to A • x = b for a given 
matrix A and vector b. Again, we only present the pseudocode. For more details, we refer the reader to 
the highly cited paper [|13il in which Van der Vorst introduced the method. 

BiCGStab(A,b): 

X := random vector 

r := b — A x 

q := random vector such that q-Vy^O 
y :- I; a := 1; w := 1; v := 0; p := 
repeat 

y := y 
y ■= q r 



P := r+|^-(p-wv) 
V := A p 



a 



y 

q V 

s := r — a-y 
i := A s 

w := 

X := x + a-p + w-s 

r := s — w-t 
until X is accurate enough 
return x 



4 General Purpose Graphics Processing Units (GPGPU) 

Graphics processing units (GPUs) were primarily developed to improve the performance of graphics- 
intensive programs. The market driving their production has traditionally been video game players. They 
are a throughput-oriented technology, optimized for data-intensive calculations in which many identical 
operations can be done in parallel on different data. Unlike most commercially available multi-core 
central processing units (CPUs), which normally run up to eight threads in parallel, GPUs are designed 
to run hundreds of threads in parallel. They have many more cores than CPUs, but these cores are 
generally less powerful. GPUs sacrifice latency and single-thread performance to achieve higher data 
throughput [8|. 

Originally, GPU APIs were designed exclusively for graphics use, and their underlying parallel ca- 
pabilities were difficult to access for other types of computation. This changed in 2006 when NVIDIA 
released CUDA, the first architecture and API designed to allow GPUs to be used for a wider variety of 
applications [10, Chapter 1]. 

All parallel algorithms discussed in this paper were implemented using CUDA. Different generations 
of CUDA-enabled graphics cards have different compute capability levels, which indicate the sets of 
features that they support. The graphics card used here is the NVIDIA GTX260, which has compute 
capability 1.3. Cards with higher compute capability have been released by NVIDIA since the GTX260 
was developed. 

The GTX260 has 896MB global on-card memory. This GPU supports double-precision floating- 
point operations, which is required for BiCGStab. Lower compute capabilities (1.2 and below) do not 
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support this. The GTX260 does not support atomic floating-point operations. This limits the way op- 
erations that require dot-products and other row sums can be separated into threads. Since these sums 
require all elements of a row to be summed to a global variable, and atomic addition can not be used to 
protect the integrity of the sum, the structure of the threads must do so. Here, we have simply created 
one thread per row and done these additions sequentially. The next generation of NVIDIA GPUs (com- 
pute capability 2.0 and higher) support atomic floating-point addition, though only on single-precision 
numbers. 

Current GPUs have limited memory to store matrix data. For all but the smallest matrices, some sort 
of compression is necessary so that they can be transferred to the GPU in their entirety. 

For this project, data is stored using a compressed row storage, as described in f3^, page 57]. In 
this format, only the non-zero elements of the matrix are recorded. The size of an n xn matrix with 
m non-zero elements compressed in this manner is 0{n + m), whereas uncompressed it is 0{n^). This 
representation saves significant space when used to store sparse matrices. 

A compressed row matrix consists of three vectors. The vector rstart contains integers and has 
length n + \, where rstart i is the total number of non-zero elements in the first / rows. The vector col also 
contains integers and has length m. It stores the column position of each non-zero element. The vector 
nonzero contains floating points and has length m. It stores the non-zero elements of the matrix. 

For example, the matrix 

[10 
2 
3 

is represented by the vectors 



rstart 


[0, 1, 2, 3] 


col 


[0, 2, 2] 


nonzero 


[1,2, 3] 



5 Parallel Implementations in CUDA 

Next, we present parallel versions of the Jacobi method and the BiCGStab method. These parallel 
versions are implemented in CUDA. 

5.1 The Jacobi Method 

The Jacobi method was parallelized as in H. Given an « x ^-matrix A and an «- vector b, in the parallel 
implementation of the Jacobi method n threads are created. For each thread /, where <i < n, the 
following algorithm is used. Essentially, one thread traverses each row of the matrix A to compute the 
corresponding element of x. Thus, each element of x is computed in paralUel. 

/ := thread id 
if J = then 

terminate := true 
if i<n then 

d := bi 

I : = rstarti 

h := rstart i+\ — 1 

for all j = l...h do 
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d : = d — nonzero j ■ Xcoi 

^' • A,, 

if Xi is not accurate enough then 
terminate = false 

The variable terminate is shared by all the thi^eads to determine when the iteration can stop. Note that 



line 5-10 of the code above corresponds to line 5-6 of the Jacobi method in Section 3.1 Thus, this 
algorithm represents one iteration of the Jacobi method. The main program repeatedly launches the 
corresponding code on the GPU, waiting between executions for all threads to complete, until x reaches 
the desired accuracy. 



5.2 The BiCGStab Method 

Each iteration of the BiCGStab method consists of several matrix and vector multiplications, that each 
result in a vector. As most of these must be done in order, they were parallelized individually by creating 
one thread for each element of the resulting vectors. 

For instance, the operation in line 15 is split between n threads, so for each < / < « a thread does 
the following: 

Xi =Xi + a- pi + w ■ Si 

And thus each element of the vector is calculated in parallel. The matrix operation in line 10 is done as 
follows for each of the n threads: 

Vi = Ar p 

where A, denotes the row of the matrix A. Dot products are done sequentially. It would be possible 
to increase parallelism by splitting these into multiple sums done in parallel. This may require too much 
overhead to result in a significant performance gain. 

Currently, CUDA requires all threads running on a GPU to execute the same code. There are some 
separate steps of the BiCGStab method which could be executed in parallel, but this is not possible with 
a single GPU of the type used here. This may be possible using the next generation of GPUs, or multiple 
GPUs. However, most steps of the algorithm must be done in sequence. 

Below is an abbreviated version of the CUDA code used to implement BiCGStab. To save space, 
non-essential code has been removed, and some steps are summarized in square brackets. Kernel num- 
bering corresponds to the line numbers of the algorithm in Section [T2| Kernels for subsequent steps are 
combined when they require the same number of threads. Sequential steps are done on the GPU with a 
single thread, since this avoids time-consuming data transfers between the GPU and host computer. 

The first portion of the code, below, executes on the CPU. Pointers prefixed by d- indicate data stored 
on the GPU, and n is the dimension of the matrix. The matrix A is represented in row-compressed form 
on the GPU as d-col, d-rstart and djionzero. 

int terminate = 0; 

for(int i = 0; i <= max_iterations && ! terminate ; i++){ 

[d_yprime = d_y, switch pointers without moving data] 
step8Kernel<«l , 1»> (d_B, d_y, d_yprime, d_a, d_w, d_q, d_r, n) ; 
step9Kernel<«grid,block>»(d_p, d_r, d_B, d_w, d_v, n, blocksz) ; 

matrixVectorMult<«grid,block>» (d_col , d_rstart, d_nonzero, d_p, d_v, n, blocksz); 
stepllKernel«<l , l>»(d_a, d_y, d_q, d_v, n) ; 
stepl2Kernel«<grid,block»> (d_s , d_r, d_a, d_v, n, blocksz); 

matrixVectorMult<«grid,block>» (d_col , d_rstart, d_nonzero, d_s, d_t , n, blocksz); 
stepl4Kernel«<l,l»>(d_w, d_t , d_s, n) ; 
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stepl5_16Kernel«<grid,block»>(d_x, d_a, d_p, d_w, d_s, d_r, d_t , n, blocksz, d_terminate) ; 
[terminate = d_terminate , transferred from GPU to host machine] 

} 

The kernel methods below execute on the GPU. The kernels for steps 12 are 14 are very similar 
to previous steps, and are excluded. The matrixVectorMult method, not shown, is a generic parallel 
matrix- vector multiplication kernel. 

global static void stepSKernel (double *d_B, double *d_y, double *d_yprime, double *d_a, 

double *d_w, double *d_q, double *d_r, int n){ 
double result = 0; 

for(int i = 0; i < n; i++){ result += d_q[i] * d_r [i] ; } 
*d_y = result; 

*d_B = (*d_y / *d_yprime) * (*d_a / *d_w) ; //prepare scalar for step 9 

> 

global static void step9Kernel (double *d_p, double *d_r, double *d_B, double *d_w, 

double *d_v, int n, int blocksz) { 

int i = blockldx.x * blocksz + threadldx.x; //thread index 

if(i < n){ d_p[i] = d_r[i] + *d_B * ( d_p[i] - *d_w * d_v[i]); } 

} 

global static void stepllKernel (double *d_a, double *d_y, double *d_q, double *d_v, int n){ 

double dot_q_v = 0; //dot product of q,v 

for(int i = 0; i < n; i++){ dot_q_v += d_q[i] * d_v [i] ; } 
*d_a = *d_y / dot_q_v; 

} 

global static void stepl5_16Kernel (double *d_x, double *d_a, double *d_p, double *d_w, 

double *d_s, double *d_r, double *d_t , int n, int blocksz, int* d_Terminate) { 

int i = blockldx.x * blocksz + threadldx.x; //thread index 

if(i == 0)-[ *d_Terminate = 1; } //one thread sets d_Terminate 

if(i < n){ 

d_x[i] = d_x[i] + *d_a * d_p[i] + *d_w * d_s[i]; //step 15 
double diff = abs(d_s[i]); 

if(diff > ERROR) { *d_Terminate = 0;} //stop when residual is near zero 
d_r[i] = d_s[i] - *d_w * d_t [i] ; //step 16 

> 

> 

6 Performance on Randomly Generated Matrices 

For these tests, random matrices were generated. The entries were random positive integers, placed in 
random locations. The matrices were then modified to have non-zero diagonal entries and be diagonally- 
dominant, which ensured that both the Jacobi and BiCGStab methods were able to solve equations con- 
taining them. For each matrix A, a vector b of random integers was also created. This formed the 
equation A • x = b, to be solved for x. Each trial used a newly-generated matrix and vector. 

Figure [T] shows the performances of the four implementations on random matrices of several sizes, 
and varying densities. The matrix densities are similar to those encountered in probabilistic model check- 
ing. These graphs indicate that the implementations' relative performances change their order as the 
density increases. Furthermore, as the size of the matrix increases, the density at which the performance 
order changes becomes lower. Thus, which implementation performs best depends on the size and den- 
sity of the matrix it is being used on. 
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Generally, the graphs in Figure[T]show that the BiCGStab method is superior to the Jacobi method for 
denser matrices, but the Jacobi method performs best for very sparse matrices. They also demonstrate 
a consistent performance benefit from using CUDA to implement the Jacobi method. However, the 
third graph shows that the CUDA version of the BiCGStab method is only beneficial for larger, denser 
matrices. For other matrices, the sequential version of the BiCGStab method outperforms the CUDA 
version. 




Figure 1: The effect of varying density on performance, for three matrix sizes. A logarithmic scale is 
used on the y-axis. Average times are based on 20 trials. The standard deviations of the times measured 
are too small to be shown. 



To confirm this. Figure |2] and Figure [3] show the implementations' performances on random matri- 
ces with 10% density. As the sizes of these relatively dense matrices increase, the CUDA versions of 
both implementations increasingly outperform their sequential counterparts, and the BiCGStab method 
significantly outperforms the Jacobi method. 



Performance - Random Matrices with 10% Density 
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Figure 2: Performance on randomized matrices of varying sizes, all with 10% density. 
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Performance - Random Matrices with 10% Density 
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Figure 3: BiCGStab performance on randomized matrices of varying sizes, all with 10% density. 



7 Performance on Probabilistic Model Checking Data 



For these tests, the implementations were tested on matrices representing the transition probability func- 
tions of Markov chains based on actual randomized algorithms. These matrices were then reduced to 



only 5? states and subtracted from the identity matrix, as discussed in Section 2.3 



JPF was used on two randomized algorithms to create this data. The biased die algorithm simulates 
a biased dice roll by means of biased coin flips, and random quick sort is a randomized version of the 
quick sort algorithm. Both algorithms were coded in Java by Zhang, who also created the JPF extension 
that outputs transition probabiUty matrices of Markov chains that correspond to the code being checked 

m. 

The JPF search strategies used were chosen to create the largest 5? matrices relative to the size of the 
searched space. JPF's built-in depth first search was used for random quick sort, and a strategy called 
probability first search that prioritizes high-probability transitions, also written by Zhang [14], was used 
for the biased die example. 

The results of the random quick sort tests are shown in Figure|4j and the results of the biased die tests 
in Figure |5] Error bars representing standard deviations of each data point are too small to be visible in 
the graphs. The matrix sizes in the random quick sort tests are much smaller than those in the biased die 
tests because, while the matrices for random quick sort are initially the same size as or larger than those 
produced for biased die, much fewer states belong to the 5? set. Furthermore, note that the densities of 
these matrices decrease as their sizes increase. The sizes and densities of the matrices used in these tests 
are shown in Table [T] 

The relative performances of the sequential and CUDA implementations of the Jacobi method, and 
the sequential implementation of the BiCGStab method, are different for each algorithm. The best per- 
formance on the random quick sort data was from the sequential implementation of the Jacobi method, 
while the best performance for biased die was from the CUDA implementation of the Jacobi method. 
The CUDA implementation of the BiCGStab method performs worst in both cases. 
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Figure 4: Performance on matrices output by JPF, while checking the random quick sort algorithm. 
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Figure 5: Performance on matrices output by JPF, while checking the biased die algorithm. 



7.1 Comparing Probabilistic Model Checking Data with Random Matrices 

For these tests, random matrices were generated with the same densities and sizes as those produced by 
JPF (as in Table [T]l. The JPF matrices are reduced transition probability matrices, subtracted from the 
identity matrix. So, they have entries in the interval [-1, 1] with locations based on the structure of a 
Markov chain. In contrast, entries in the randomized matrices are randomly-located positive integers, 
as described in Section [6] Unlike in the JPF matrix tests, each trial uses a different matrix and vector. 
However, the standard deviations of the times measured are still too small to be shown on the graphs. 

Performance results for these matrices are shown in Figure [6] and Figure [7] It is apparent that the 
ordering of the different implementations' performances is the same as it was for matrices of the same 
sizes and densities generated by JPF. This suggests that size and density are the main determinants of 
which implementation performs best on probabilistic model checking data, and whether CUDA will be 
beneficial, rather than other properties unique to the matrices found in probabilistic model checking. 

Generally, it appears that for the particular types of matrix encountered during probabilistic model 
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Table 1: Sizes and densities of the matiices produced by JPF for the two algorithms, n is the matrix 
dimension, and m is the number of non-zero entries. 
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Figure 6: Performance on randomized matrices, with the same sizes and densities as the matrices pro- 
duced by JPF when checking the random quick sort algorithm. Averages times are based on 40 matrices. 



checking, implementing the BiCGStab method in CUDA does not improve performance. CUDA does, 
however, improve the performance of the Jacobi method. 

In in, the authors conjecture that the Jacobi method would be more suitable for probabilistic model 
checking using CUDA than Krylov subspace methods such as the BiCGStab method. This seems to be 
true. However, the results in Section |6] suggest that this is due to the superior performance of the Jacobi 
method on sparse matrices in general, rather than BiCGStab's higher memory requirements as proposed 
in I^J. For the probabilistic model checking data, the matrix density decreases as the size increases, so 
the conditions in which the CUDA BiCGStab implementation performs best (larger, denser matrices, as 
in Figure [2]l are not encountered. 
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Figure 7: Performance on randomized matrices, with the same sizes and densities as the matrices pro- 
duced by JPF when checking the biased die algorithm. Average times are based on 40 matrices. 



8 Related and Future Work 

Related work by Bosnacki et al. [4] tests a CUDA implementation of the Jacobi method on data obtained 
by another probabilistic model checker, PRISM. They used different probabilistic algorithms in their 
probabilistic model checker, and for each algorithm their results show a benefit from GPU usage. In a 
later expansion of their research (SJ, they test a CUDA implementation of the lacobi method that uses 
a backward segmented scan, and one using two GPUs, which further improve performance. In their 
second paper they also compare the performance on 32- and 64-bit systems, and find that for one of 
the three algorithms they model check, the CPU algorithm on the 64-bit system outperforms the CUDA 
implementation. In another related paper, Bamat et al. ||2l improve the performance of the maximal 
accepting predecessors (MAP) algorithm for LTL model checking by implementing it using CUDA. 

Future work could include experiments on model checking data generated from the probabilistic 
algorithms used with the model checker in [4J, to allow closer comparison with that work. Another 
possibility is to implement the CUDA BiCGStab algorithm using multiple GPUs, so that different steps 
can be run in parallel. 

In [ 1 f], Kwiatkowska et al. suggest to consider the strongly connected components of the underlying 
digraph of the Markov chain in reverse topological order. Since the sizes and densities of the matrices 
corresponding to these strongly connected components may be quite different from the size and density 
of the matrix corresponding to the Markov chain, we are interested to see whether this approach will 
favour different implementations. 
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